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Abstract 

Missing data estimation is an important challenge with high-dimensional data 
arranged in the form of a matrix. Typically this data matrix is transposable, mean- 
ing that either the rows, columns or both can be treated as features. To model 
transposable data, we present a modification of the matrix-variate normal, the 
mean-restricted matrix-variate normal, in which the rows and columns each have 
a separate mean vector and covariance matrix. By placing additive penalties on 
the inverse covariance matrices of the rows and columns, these so called trans- 
posable regularized covariance models allow for maximum likelihood estimation of 
the mean and non-singular covariance matrices. Using these models, we formu- 
late EM-type algorithms for missing data imputation in both the multivariate and 
transposable frameworks. We present theoretical results exploiting the structure 
of our transposable models that allow these models and imputation methods to be 
applied to high-dimensional data. Simulations and results on microarray data and 
the Netflix data show that these imputation techniques often outperform existing 
methods and offer a greater degree of flexibility. 



1 Introduction 

As large datasets have become more common in biological and data mining applications, 
missing data imputation is a significant challenge. We motivate missing data estimation 
in matrix data with the example of the Netflix movie rating data (j2j. This dataset 
has around 18,000 movies (columns) and several hundred thousand customers (rows). 
Customers have rated some of the movies, but the data matrix is very sparse with a only 
small percentage of the ratings present. The goal is to predict the ratings for unrated 
movies so as to better recommend movies to customers. The movies and customers, 
however, are very correlated and an imputation method should take advantage of these 
relationships. Customers who enjoy horror fllms, for example, are likely to rate movies 
similarly, in the same way that horror fllms are likely to have similar ratings from these 
customers. Modeling the ratings by the relationships between only the movies or only 
the customers as with multivariate methods and fc-nearest neighbor methods, seems 
shortsighted. Customer A's rating of Movie 1, for example, is related to Customer B's 
rating of Movie 2 by more than simply the connection between Customer A and B or 
Movie 1 and 2. In addition, modeling ratings as a linear combination of the ratings 
of movies or a combination of customer ratings as with singular value decomposition 
(SVD) methods fails to capture a more sophisticated connection between the movies 
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and customers PU|) . Bell et. al, in their discussion of imputation for the Netflix data, 
call all of these methods either "movie-centric" or "user-centric" (|T|) . 

We propose to directly model the correlations among and between both the cus- 
tomers (rows) and the movies (columns) . Thus, our model is transposable in the sense 
that it treats both the rows and columns as features of interest. The model is based 
on the matrix- variate normal distribution brought to our attention by Efron which 
has separate covariance matrix parameters for both the rows and the columns. Thus, 
both the relationships between customers and between movies are incorporated in the 
model. If matrix- variate normal data is strung out in a long vector, then it is distributed 
as multivariate normal with the covariance related to the original row and column co- 
variance matrices through their Kronecker product. This means that the relationship 
between Customer A's rating of Movie 1 and Customer B's rating of Movie 2 can be 
modeled directly as the interaction between Customers A and B and Movies 1 and 2. 

In practice, however, transposable models based on the matrix-variate normal distri- 
bution have largely been of theoretical interest and have rarely applied to real datasets 
because of the computational burden of high-dimensional parameters (|17p . In this pa- 
per, we introduce modifications of the matrix-variate normal distribution, specifically 
restrictions on the means and penalties on the inverse covariances, that allow us to fit 
this transposable model to a single matrix of data. The penalties we employ give us non- 
singular covariance estimates that have connections to the singular value decomposition 
and graphical models. With this theoretical foundation, we present computationally 
efficient Expectation Maximization-type (EM) algorithms for missing data imputation. 
We also develop a two-step process for calculating conditional distributions and an al- 
gorithm for calculating conditional expectations of scattered missing data that has the 
computational cost of comparable multivariate methods. These contributions allow one 
to fit this parametric transposable model to a single data matrix at reasonable compu- 
tational cost, opening the door to numerous applications including user-ratings data. 

We organize the paper beginning with a review of the multivariate regularized co- 
variance models (RCM) and a new imputation method based on these models. Section 
|2] The RCMs form the foundation for the transposable regularized covariance models 
(TRCM) introduced in Section Is] We then present new EM- type imputation algorithms 
for transposable data. Section ^ along with a one-step approximation in Section |4.2[ 
Simulations and results on microarray and the Netflix data are given in Section [s] and 
we conclude with a discussion of our methods in Section |6l 

2 Regularized Covariance Models and Imputation with 
Multivariate Data 

Several recent papers have presented algorithms and discussed applications of reg- 
ularized covariance models (RCM) for the multivariate normal distribution (fT4| l32|l . 
These models regularize the maximum likelihood estimate of the covariance matrix by 
placing an additive penalty on the inverse covariance or concentration matrix. The re- 
sulting estimates are nonsingular, thus enabling covariance estimation when the number 
of features is greater than the number of observations. In this section, we give a review 
of these models and briefly describe a new penalized EM algorithm for imputation of 
missing values using the regularized covariance model. 

Let Xi ^ N{0, A) ioi i — 1 . . .n, i.i.d. observations and p features. Thus, our data 
matrix, X is n x p with covariance matrix A £ W^^. The penalized log- likelihood of 
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the regularized covariancc model is then proportional to 

^(A) = |log| A-^ I - itr(X^X A"^) - p|| A"^ ||' (1) 

2 

where || • H'^ = X^iLi I ' 1"^ ^^'^ 9 either 1 or 2, i.e. the sum of the absolute value or 
square of the elements of A"^. The penalty parameter is p. With an L2 penalty, we can 
write the penalty term as p tr(A"-^ A"^) = p\ \ A"^ |||. 

Maximizing f(A) gives the penalized-maximum likelihood estimate (MLE) of A. 
Friedman et. al (|14p present the graphical lasso algorithm to solve the problem with an 

Li penalty. The graphical lasso uses the lasso method iteratively on the rows of A , 

and gives a sparse solution for A . A zero in the ij*'' component of A"^ implies that 
variables i and j are conditionally independent given the other variables. Thus, these 
penalized-maximum likelihood models with Li penalties can be used to estimate sparse 
undirected graphs. With an L2 penalty, the problem has an analytical solution (3T). If 
we take the singular value decomposition (SVD) of X, X = U D V'^, with d = diag(D), 
then 

A = Vdiag(.)V-, e. = 'l±^S±I^, (2) 

Thus, the inclusion of the L2 penalty simply regularizes the eigenvalues of the covari- 
ancc matrix. When p > n and letting r be the rank of X, the final n — r values of 9 
are constant and are equal to l^fpjn. While a rank-/c SVD approximation uses only 
the first k eigenvalues, the L2 RCM gives a covariance estimate with all non-zero eigen- 
values. Regularized covariance models provide an alternative method of estimating the 
covariance matrix with many desirable properties (i27j) . 

With this underlying model, we can form a new missing data imputation algorithm 
by maximizing the observed penalized log-likelihood of the regularized covariance model 
via the EM algorithm. Our method is the same as that of the EM algorithm for the 
multivariate normal described in Little and Rubin (1211), except for an addition in the 
maximization step. In our M step, we find the MLE of the RCM covariance matrix 
instead of the multivariate normal MLE. Thus, our method fits into a class of penalized 
EM algorithms which give non-singular covariance estimates (tl6|, thus enabling use 
of the EM framework when p > n. We give full details of the algorithm, which we 
call RCMimpute, in [A] As we will discuss later, this imputation algorithm is a special 
case of our algorithm for transposable data and forms an integral part of our one-step 
approximation algorithm presented in Section |4.2[ 



3 Transposable Regularized Covariance Models 

As previously mentioned, we model the possible dependencies between and within the 
rows and columns using the matrix-variate normal distribution. In this section, we first 
present a modification of this model, the mean-restricted matrix-variate normal distri- 
bution. We confine the means to limit the total number of parameters and to provide 
interpretable marginal distributions. We then introduce our transposable regularized 
covariance models by applying penalties to the covariances of our matrix-variate distri- 
bution. Finally, we present the penalized-maximum likelihood parameter estimates and 
illustrate the connections between these estimates and those of multivariate models, the 
singular value decomposition and graphical models. 
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3.1 Mean- Restricted Matrix- variate Normal Distribution 



We introduce the mean-restricted matrix-variate normal, a variation on the matrix- 
variate normal, presented by Gupta and Nagar (fTT]! . A restriction on the means is 
needed because the matrix-variate normal has a mean matrix, M, of the same dimension 
as X, meaning that there are nxp mean parameters. Since the matrix-variate normal is 
mostly applied in instances where there are several independent samples of the random 
matrix X (Sj), this parameter formulation is appropriate. We propose, however, to use 
the model when we only have one matrix X from which to estimate the parameters. 
Also, we wish to parameterize our model so that the marginals are multivariate normal, 
thus easing computations and improving interpret-ability. 

We denote the mean-restricted matrix-variate normal distribution by X ~ iVn,p (j^, A) 
with X S SR"^^, the row mean G 3?", the column mean fi G W, the row covariance 
S e 3?"^" and the column covariance A e 3?^^^. If we place the matrix X into a 
vector of length np, we have vec(X) ~ N (vec(M), fJ) where M — J^l^) + and 
r2 = A (8) S. Thus, our mean-restricted matrix-variate normal model is a multivariate 
normal with a mean matrix composed of additive elements from the row and column 
mean vectors and a covariance matrix given by the Kronecker product between the row 
and column covariance matrices. This covariance structure can been seen as a tensor 
product Gaussian process on the rows and columns, an approach explored in (gl . 

This distribution implies that a single element, Xij has mean Ui + jij along with 
variance Ha ^jj, a mean and variance component from the row and column to which it 
belongs. As pointed out by a referee, this can be viewed as the following random effects 
model: Xij — Vi+fj^j+eij, where e^j ~ ^'^(O, Ha ^jj), which has two additive fixed effects 
depending on the row and column means and a random effect whose variance depends 
on the product of the corresponding row and column covariances. This model shares 
the same first and second moments as elements from the mean-restricted matrix-variate 
normal. It does not, however, capture the Kronecker covariance structure between the 
elements of X unless both the row and column covariances, S and A are diagonal. This 
random effect model differs from the more common two-way random effects model with 
additive errors, which assumes that errors from the two sources are independent. Our 
model, however, assumes that the errors are related and models them as an interaction 
effect. A similar random effects approach was taken in (|34p . also using a Kronecker 
product covariance matrix. 

To further illustrate the model, we note that the rows and columns are both marginally 
multivariate normal. The z*'* row, denoted as Xj^, is distributed as, Xi^. ~ N {vi + /i, A) 
and the column denoted by X^j, is distributed as X^j ^ N {v + Hj,Ajj S) . The fa- 
miliar multivariate normal distribution is a special case of the mean-restricted matrix- 
variate normal as seen by the following two statements. If S = I and v — Q then, 
X N{fi,A), and if A = I and ^ = then, X ^ N{i/,T,). Also, two elements 
from different rows or columns are distributed as a bivariate normal, {Xij, Xi'ji) ^ 

N ( ( '^^ ^ 1,1 I I . Thus, our model is more general than 

the multivariate normal, with the flexibility to encompass many different marginal mul- 
tivariate models. 

For completeness, the density function of this distribution is 

p(;/,^,S,A) = (27r)-T | E | A]"? 

etrf -^(X-!/lfp) - A"^(X-!/lfp) - l(„)^i^)^S"M , 
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where etr() is the exponential of the trace function. Hence, our formulation of the 
matrix- variate normal distribution adds restrictions on the means, giving the distribu- 
tion desirable properties in terms of its marginals and easing computation of parameter 
estimates, discussed in the following section. 

3.2 Transposable Regularized Covariance Model (TRCM) 

In the previous section, we have reformulated the distribution to limit the mean pa- 
rameters and in this section, we regularize the covariance parameters. This allows us to 
obtain non-singular covariance estimates which are important for use in any application, 
including missing data imputation. 

As in the multivariate case, we seek to penalize the inverse covariance matrix. Instead 
of penalizing the overall covariance, rj, we add two separate penalty terms, penalizing 
the inverse covariance of the rows and of the columns. The penalized log-likelihood is 
thus 

f(!.,M,S,A) = |log|S-^| + ^log|A-^ 

- ^tr (^S"^(X-!/lfp) - l(„)/i^) A"^(X-!/lfp) - 

-prWS-'W"- -pc\\A-^\\''. (3) 

2 

where || • ||''' = X^I^i I ' 1'"^ ^^"^ 1r ^^'^ Qc are either 1 or 2, i.e. the sum of the 
absolute value of the matrix elements or squared elements, pr and pc are the two 
penalty parameters. Note that we will refer to the four possible types of penalties as 
Lq^ : Lq^. Placing separate penalties on the two covariance matrices is not equivalent 
to placing a single penalty on the Kronecker product covariance matrix 17. Using two 
separate penalties gives greater flexibility, as the covariance of the rows and columns can 
be modeled separately using differing penalties and penalty parameters. Also, having 
two penalty terms leads to simple parameter estimation strategies. 

With transposable regularized covariance models, as with their multivariate counter- 
part, the penalties are placed on the inverse covariance matrix, or concentration matrix. 
Estimation of the concentration matrix has long been associated with graphical models, 
especially with an Li penalty which is useful to model sparse graphical models (14J. 
Here, a non-zero entry of the concentration matrix, S.^ ^ 0, means that the row 
conditional on all other rows is correlated with row j. Thus, a "link" is formed in the 
graph structure between nodes i and j. Conversely, zeros in the concentration matrix 
imply conditional independence. Hence, since we are estimating both a regularized row 
and column concentration matrix, our model can be interpreted as modeling both the 
rows and columns with a graphical model. 

3.3 Parameter Estimation 

We estimate the means and covariances via penalized maximum likelihood estima- 
tion. The estimates, however, are not unique, but the overall mean, M, and overall 
covariance Ct are unique. Hence, v and jl are unique up to an additive constant and S 
and A are unique up to a multiplicative constant. We first begin with the maximum 
likelihood estimation of the mean parameters. 

Proposition 1. The MLE estimates for v and p, are 



where Xcj denotes the j*'' column and Xir the z*^ row ofX-E 3?"^^. 
Proof: See\^ 

The estimates for v and fi are obtained by centering with respect to the rows and 
then the columns. Note that centering by the columns first will change jl and v, but will 
still give the same additive result. Thus, the order in which we center is unimportant. 

Maximum likelihood estimation of the covariance matrices is more difficult. Here, we 
will assume that the data has been centered, M = 0. Then, the penalized log-likelihood, 
A), is a bi-concave function of S"^ and A"^. In words, this means that for any 
fixed S"^', A) is a concave function of A"^, and for any fixed A"^', ^(S, A') is a 

concave function of S"^. We exploit this structure to maximize the penalized likelihood 
by iteratively maximizing along each coordinate, either S"^ or A"^. 

Proposition 2. Iterative block coordinate-wise maximization o/£(5],A) with respect 
to S"^ and A"^ converges to a stationary point of £{H, ^) for both Li and L2 penalty 
types. 

Proof: 5*66 

While block coordinate- wise maximization ( Proposition |2]) reaches a stationary point 
of £{11, A), it is not guaranteed to reach the global maximum. There are potentially 
many stationary points, especially with Li penalties, due to the high-dimensional nature 
of the parameter space. We also note a few straightforward properties of the coordinate- 
wise maximization procedure, namely that each iteration monotonically increases the 
penalized log- likelihood and the order of maximization is unimportant. 

The coordinate-wise maximization is accomplished by setting the gradients with 
respect to S"^ and A"^ equal to zero and solving. We list the gradients with L2 penalties. 
With Li penalties only the third term is changed and is given in parentheses. 



Maximization with Li penalties can be achieved by applying the graphical lasso al- 
gorithm to the second term with the coefficient of the third term the as the penalty 
parameter. With L2 penalties, we maximize by taking the eigenvalue decomposition of 
the second term and regularizing the eigenvalues as in the multivariate case, (|2|. Thus, 
coordinate-wise maximization leads to a simple iterative algorithm, but it comes at a 
cost since it does not necessarily converge to the global maximum. When both penalty 
terms are L2 penalties, however, we can find the global maximum. 

3.3.1 Covariance Estimation for L2 Penalties 

Covariance estimation when both penalties of the transposable regularized covari- 
ance model are L2 penalties reduces to a minimization problem involving the eigenvalues 
of the covariance matrices. This problem has a unique analytical solution, and thus our 
estimates, S and A, are globally optimal. 

Theorem 1. The global unique solution maximizing ^(S, A) with L2 penalties on both 
covariance parameters is given by the following: Denote the SVD ofK as X = UD V"'" 
with d = diag(D) and let r be the rank of^, then 



d£ 




(5) 



S* = Udiag(/3*)U'^ & A* = Vdiag(r) V'^ 



(6) 
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where (3* G W+ and 0* e W+ given by 

'2,/^ tfi>r, 



'2/^ ifi>r 



otherwise. y p/if'' -4,pr 



otherwise. 



with coefficients 

= -4pcP^, 4*' = ?,2prpcP + di{n ~p), & 4'' = 4pr(d^ - WprPc). 
Proof See\^ 

With L2 penalties, maximum likelihood covariance estimates S and A have eigen- 
vectors given by the left and right singular vectors of X respectively. To reveal some 
intuition as to how these covariance estimates compare to other possible eigenvalue reg- 
ularization methods, we present the two gradient equations in terms of the eigenvalues 
jS and 9. (These are discussed fully in the proof of Theorem [ij . 

pOiP^ - d^.Pi ~ 4/5,6', = & nfte? - 46», - 4pcft = 0. 

These are two quadratic functions in /? and 9, so the quadratic formula gives us the 
eigenvalues in terms of each other. We see that the eigenvalues regularize the square 
of the singular values by a function of the dimensions, the penalty parameters and 
the eigenvalues of the other covariance estimate. From Theorem [l| L2 ■ L2 covariance 
estimation has a unique and globally optimal solution, which cannot be said of the other 
combinations of penalties. We give numerical results comparing our TRCM covariance 
estimates to other shrinkage covariance estimators in[B] 

Here, we also pause to compare our TRCM model with L2 penalties to the singular 
value decomposition model commonly employed with matrix data. If we include both 
row and column intercepts, we can write the rank- reduced SVD model as Xij = Vi+fij + 
DrVj -I- e, where and vj are the and j*'' right and left singular vectors, is 
the rank-reduced diagonal matrix of singular values and e ~ N{0, a^). Thus, the model 
appears similar to L2 TRCM, which can be written as Xij — Vi + fij + tij where e^j 
N{0, uf dia.g{f3)ui * vjdiag(6')vj). There are important differences between the models, 
however. First, the left and right singular vectors are incorporated directly into the SVD 
model whereas they form the bases of the variance component of TRCM. Secondly, a 
rank-reduced SVD incorporates only the first r left and right singular vectors. Our 
model uses all the singular vectors as f3 and 9 are of lengths n and p respectively. 
Finally, the SVD allows the covariances of the rows to vary with , whereas with TRCM 
the rows share a common covariance matrix. Thus, while the SVD and TRCM share 
similarities, the models differ in structure and hence each offers a separate approach to 
matrix-data. 



4 Imputation for Transposable Data 

Imputation methods for transposable data are the main focus of this paper. We 
formulate methods based on the transposable regularized covariance models introduced 
in Section [3j Because computational costs have limited use of the matrix- variate normal 
in applications, we let computational considerations motivate the formulation of our 
imputation methods. 

We propose a Multi-Cycle Expectation Conditional Maximization (MCECM) algo- 
rithm, given by Meng and Rubin (,23,) . maximizing the observed penalized log-likelihood 
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of the transposable regularized covariance models. The algorithm exploits the structure 
of our model by maximizing with respect to one block of coordinates at a time, saving 
considerable mathematical and computational time. First, we develop the algorithm 
mathematically, provide some rationale behind the structure of the algorithm via nu- 
merical examples, and then briefly discuss computational strategies and considerations. 

In high dimensional data, however, the MCECM algorithm we propose for impu- 
tation is not computationally feasible. Hence, we suggest a computation-saving one- 
step approximation in Section |4.2[ The foundation of our approximation lies in new 
methods, given in Theorems |2] and [3] for calculating conditional distributions with the 
mean-restricted matrix-variate normal. We also demonstrate the utility of this one-step 
procedure in numerical examples. A Bayesian variation of the one-step approximation 
using Gibbs sampling is given in[D| 

Prior to formulating the imputation algorithm for transposable models, we pause 
to address a logical question: Why do we not use the multivariate imputation method 
based on regularized covariance models, given that the mean-restricted matrix normal 
distribution can be written as a multivariate normal with vec(X) ^ N (vec(M.) , ft)! 
There are two reasons why this is inadvisable. First, notice that TRCMs place an 
additive penalty on both the inverse covariance matrices of the rows and the columns. 
The overall covariance matrix, ft, however, is their Kronecker product. Thus, converting 
the TRCM into a multivariate form yields a messy penalty term leading to a difficult 
maximization step. The second reason to avoid multivariate methods is computational. 
Recall that ^1 is a, np x np matrix which is expensive to repeatedly invert. We will 
see that the mathematical form of the ECM imputation algorithm we propose leads to 
computational strategies that avoid the expensive inversion of il. 



4.1 Multi-Cycle ECM Algorithm for Imputation 

Before presenting the algorithm, we first review the notation used throughout the 
remainder of this paper. As previously mentioned, we use i to denote the row index and 
j the column index. The observed and missing parts of row i are Oi and respectively, 
and Oj and rrij are the analogous parts of column j. We let m and o denote the totality 
of missing and observed elements respectively. Since with transposable data there is no 
natural orientation, we set n to always be the larger dimension of X and p the smaller. 



4.1.1 Algorithm 

We develop the ECM-type algorithm for imputation mathematically, beginning with the 
observed data log-Hkelihood which we seek to maximize. Letting x^.j — S~.^^^^(a;oj ,j — 



(7) 



-1 ll-Jr 



Pell A 



1 ||9c 



One can show that this is indeed the observed log-likelihood by starting with the 
multivariate observed log-likelihood and using vec(X) and the corresponding vec(M) 
and ft. We maximize ([t]) via an EM- type algorithm which, similarly to the multivariate 
case, gives the imputed values as a part of the Expectation step. 
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We present two forms of the E step, one which leads to simple maximization with re- 
spect to S"^ and the other with respect to A"^. This is possible because of the structure 
of the matrix- variate model, specifically the trace term. Letting 9 = {i^, /it, S, A}, the 
parameters of the mean-restricted matrix- variate normal, and letting o be the indices of 
the observed values, the E step, denoted by Q{9\9' ,Xo), has the following form. Here, 
we assume that X is centered. 



Q{e\e',Xo) = E {t{v, n, E, A)\Xo, e') oc E [tr (X^ E-i X A-') \Xo, 9' 

oc tr [e (x.'^ E"^ X \Xo, e'^ A"^ 
oc tr [e (x A-^ X^ \Xo, f') E"^ 

Thus, we have two equivalent forms of the conditional expectation which we give below. 
Proposition 3. The E step is proportional to the following form 

E [tr {x^ E"^ X A"^) \Xo, 6*'] = tr \^{x^ E"^ X + G(E"^ 

= tr [(^XA"^X^ + F(A"^ 



(8) 



whereX = 'E{X\Xo,e') and 
I tr(C(")E-i) .. 

G(E-i)= : 

\ tr(C(''i) E-i) .. 

/ tr(D(")A-i) . 

F(A-i) = : 

\ tr(D("^) A-i) . 
Proof: 5*66 fGl 



tr(C(iP)E-i) \ 

tr(C(ff'E-^) / 
tr(D(i") A-i) 

tr(D("") A-') 



C'^"'^ =Co'^{X,,,X,,,\Xo,e'), 



D("'' =Cov(X.„X,vl^o,6'')■ 



The E step in the matrix- variate normal framework has a similar structure to that of 
the multivariate normal (see|A| with an imputation step (X) and a covariance correction 
step (d^""^ and D^"')). The matrices C^^'^''^ e and D^"') e JR^^p, while G(S-i) e 

W^'P and F(A"^) e 3?"^". Note that C*^-'^'^ is sparse and only non-zero at c|-/ when 
Xij and Xiiy are both missing. C^-'-' is not symmetric, but C^-'-' = C'--' -'-', hence 
G(S"^) is symmetric. The matrices D*-" and F(A"^) are structured analogously. Thus, 
we have two equivalent forms of the E step, which will be inserted between the two 
Conditional Maximization (CM) steps to form the MCECM algorithm. 

The CM steps which maximize the conditional expectation functions, in Proposition 
[3] along either S"^ or A"^ are direct extensions of the MLE solvers for the multivariate 
RCMs. This is easily seen from the gradients. Note that we only show the gradients 
with an L2 penalty, since an L\ penalty differs only in the last term. 



dQ 

aE 

dQ 
dA- 



- = E - [x A"^ X^ + F(A"^; 



4pr 



[X^ E-^X + G(E-^)] /n-^A-^ 



E-V 



With an L2 penalty, the estimate is given by taking the eigenvalue decomposition of the 
second term and regularizing the eigenvalues as in (|2]). The graphical lasso algorithm 
applied to the second term gives the estimate in the case with an Li penalty. 

We now put these steps together and present the Multi-Cycle ECM algorithm for 
imputation with transposable data, TRCMimpute, in Algorithm [l] A brief comment 
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Algorithm 1 Imputation with Transposable Regularized Covariance Models (TR- 
CMimpute) 

1. Initialization: 

(a) Estimate v and /t from the observed data. 

(b) If Xij is missing, set Xij — i>i + flj. 

(c) Start with non-singular estimates S and A. 

2. E Step (A): Calculate X^S"^X+ G(S"Y 

3. M Step (A): 

(a) Update estimates of v and fi. 

(b) Maximize Q with respect to A"^ to obtain A. 

4. E Step (S): Calculate XA^^x'^ + F(A"^), 

5. M Step (S): 

(a) Update estimates of v and fi. 

(b) Maximize Q with respect to S"^ to obtain S. 

6. Repeat Steps 2-5 until convergence. 



regarding the initialization of parameter estimates is needed. Estimating the mean 
parameters when missing values are present is not as simple as centering the rows and 
columns as in Q. Instead, we iterate centering by rows and columns, ignoring the 
missing values by summing over the observed values, until convergence. Secondly, the 
initial estimates of S and A must be non-singular in order to preform the needed 
computations in the E step. While any non-singular matrices will work, we find that the 
algorithm converges faster if we start with the MLE estimates with the missing values 
fixed and set to the estimated mean. Some properties and numerical comparisons of 
the MCECM algorithm are given in[C| 



4.1.2 Computational Considerations 

We have presented our imputation algorithm for transposable data, TRCMimpute, but 
have not yet discussed the computations required. Calculation of the terms for the two 
E steps can be especially troublesome and thus we concentrate on these. Particularly, we 
need to find X = E (X |Xo, 0') , and the covariance terms, C^-'-' = Cov {X^j , Xcj' |Xo, 9') 

and D*-" ' = Cov (X^r, -^^i'rl^o, ^'')- The simplest but not always the most efficient 
way to compute these is to use the multivariate normal conditional formulas with the 
Kronecker covariance matrix il, i.e. if we let m be the indices of the missing values of 
vec(X) and o be the observed, 

/v^ fvec(M)fc + Jlfc„rj-„^(vec(X)„-vec(M)o) if fc G m 
vec(X)fc = <^ (9) 
I vec(A)fe it fc £ o 
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and, the non-zero elements of C and D corresponding to covariances between pairs of 
missing values come from 

Gov (vec(X)m,vec(X)m|Xo, 61') = fl mm ^mo^oo ^om- (-^O) 

This computational strategy may be appropriate for small data matrices, but even when 
n and p are medium-sized, this approach can be computationally expensive. Inverting 
n can be of order 0{n^p^), depending on the amount of missing data. So, even if 
we have a relatively small matrix of dimension 100 x 50, this inversion costs around 
©(lO^'^)! Using Gibbs sampling to approximate the calculations of the E steps in either 
a Stochastic or Stochastic Approximation EM-type algorithm (|6]) is one computational 
approach (we present Gibbs sampling as part of our Bayesian one-step approximation 
in|D]). A stochastic approach, however, is still computationally expensive and thus, an 
approximation to our MCECM algorithm is needed. 

4.2 One-Step Approximation to TRCMimpute 

For high-dimensional transposable data, the imputation algorithm, TRCMimpute, 
can be computationally prohibitive. Thus, we propose a one-step approximation which 
has a computational costs comparable to multivariate imputation methods. 

4.2.1 One-step Algorithm 

The MCECM algorithm for imputation with transposable regularized covariance mod- 
els iterates between the E step, taking conditional expectations, and the CM steps, 
maximizing with respect to the inverse covariances. Both of these steps are computa- 
tionally intensive for high-dimensional data. While each iterate increases the observed 
log-likelihood, the first step usually produces the steepest increase in the objective. 
Thus, we propose an algorithm that instead of iterating between E and CM steps, 
approximates the solution of the MCECM algorithm by stopping after only one step. 

Many have noted in other iterative maximum likelihood-type algorithms that a one- 
step algorithm from a good initial starting point often produces an efficient, if not 
comparable, approximation to the fully-iterated solution (20] llip . Thus, for our one- 
step approximation we seek a good initial solution from which to start our CM and E 
steps. For this, we turn to the multivariate regularized covariance models. Recall that 
all marginals of the mean-restricted matrix-variate normal are multivariate normal, 
and hence, if one of the penalty parameters for the TRCM model is infinitely large, 
we obtain the RCM solution (i.e. if pr = oo, we get the RCM solution with penalized 
covariance among the columns) . We propose to use the estimates from the two marginal 
distributions with penalized row covariances and penalized column covariances to obtain 
our initial starting point. This is similar to the COSSO one-step algorithm which uses 
a marginal solution as a good initial starting point, (|20|). 

Since the final goal of our approximation algorithm is missing value imputation, and 
not parameter estimation, we then tailor our one-step algorithm to favor imputation. 
First, instead of using the marginal RCM covariance estimates as starting values for the 
subsequent TRCMimpute E and CM steps, we use the marginal estimates to obtain two 
sets of imputed missing values through applying the RCMimpute method to the rows 
and then the columns. We then average the two sets of missing value estimates and 
fix these to find the maximum likelihood parameter estimates for the TRCM model, 
completing the maximization step. In summary, our initial estimates are obtained by 
applying an EM-type method to the marginal models. Biernacki et al. (j3j) similarly 
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use other EM-type algorithms to find good initial starting values for their EM mixture 
model algorithm. The final step of our algorithm is the Expectation step where we 
take the conditional expectation of the missing values given the observed values and 
the TRCM estimates. Note that the E step of the MCECM algorithm includes both an 
imputation part and a covariance correction part (see Proposition [3|. For our one-step 
algorithm, however, the covariance correction part is unnecessary since our final goal is 
missing value imputation. We give the one-step approximation, called TRCMAimpute, 
in Algorithm [2] 

Algorithm 2 One-step algorithm approximating TRCMimpute (TRCMAimpute) 

1. Initial imputation: 

(a) Impute missing values with RCMimputc assuming S = I. 

(b) Impute missing values with RCMimpute assuming A = I. 

(c) Average the two estimates. 

2. Find the MLE's of the transposable regularized covariance model, v, /t, S and A 
with the imputed missing values fixed. 

3. Set the missing values to their conditional expectation given these parameters: 
X™ = E (^X„ I Xo, i>, fi, S, A^ . 



Before discussing the calculations necessary in the final step of the algorithm, we 
pause to note a major advantage of our one-step method. If the sets of missing values 
from the marginal models using RCMimpute are saved in the first step, then TRC- 
MAimpute can give three sets of missing value estimates. Since it is often unknown 
whether a given dataset may have independent rows or columns, cross validation, for 
example, can be used to determine whether penalizing the covariances of the rows, 
columns, or both is best for missing value imputation. This is discussed in detail in[F| 



4.2.2 Conditional Expectations 

We now discuss the final conditional expectation step of our one-step approximation 
algorithm. Recall that the conditional expectation can be computed via (|9|, but this 
requires inverting Q, and is therefore avoided. Instead, we exploit a property of the mean- 
restricted matrix- variate normal, namely that all marginals of our model are multivariate 
normal. This allows us to find the conditional distributions in a two step process given 
by Theorem [2] 

Theorem 2. Let X 7V„,p {v, ^, S, A), M = vl^ + /i^l and partition X, M, S, A 
as 

X = I --"^ 1 = I ^'"-^ ) , M = f ) =(M,,, M,,), 

where i and j denote indices of a row and column respectively, k and I are vectors of 
indices of length n ~ I and p ~ I respectively, and rui and Oi denote vectors of indices 



Xi,m^ 


Xi.o^ 
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within row i and ruj and Oj indices within column j . 
Define 

r = [S,,, - Si,;, Sfe/^ Sfc,i] ® A, & * = S ® [Aj, J - Aj,, A-^j Aij] . 



Partition 



-0, 77, r and * as 7/; = ^ ^ , = (77,„^. r?oJ, 

\r r i' l* * 

y ■■- o j ,m j -■- Oj ,0 j y y ^ oj ,m,j ^Oj,Oj 

Then, 

^ N (jprrii H" mi,Oi'^ Oi,Oi {^^i,Oi V'o-i ) ; '^rrii ,mi fni^Oi^ Oi^oi^ Oi^mi) ■ 

(h) (X„^.j |Xo^^.,Xc,z) 



Proo/; 5*66 

Thus, from Theorem[2] the conditional distribution of values in a row or column given 
the rest of the matrix can be calculated in a two step process where each step takes at 
most the number of computations as required for calculating multivariate conditional 
distributions. The first step finds the distribution of an entire row or column conditional 
on the rest of the matrix, and the second step finds the conditional distribution of the 
values of interest within the row or column. By splitting the calculations in this manner, 
we avoid inverting the np x np Kronecker product covariance. This alternative form 
for the conditional distributions of elements in a row or column leads to an iterative 
algorithm for calculating the conditional expectation of the missing values given the 
observed values. We call this the Alternating Conditional Expectations Algorithm, 
given in Algorithm [Sj 

Algorithm 3 Alternating Conditional Expectations Algorithm. 

1. Initialize x| — Ui + pj for X^j- G X„j. 

2. For each row, i, with missing values: 

(a) Set X ■ „^ = E (Xi^^, \ 'X.^'^l > ^^:^lr 

3. For each column, j, with missing values: 

(a) Set<t;^=E(x,„^,|xi;)^,xg^. 

4. Repeat Steps 1 and 2 until convergence. 



Theorem 3. Let X ^ iV„,p {i^, /i, S, A) and partition 

vec(X) — (vec(Xm) vec(Xo)) where m and a are indices partitioned by rows, (rrii and 
Oi) and columns, (mj and Oj), so that a row, X^^^ — (Xi,m. Xi^oJ and a column Xcj — 
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( "^."'^ ) . Then, the Alternating Conditional Expectations Algorithm, Algorithm 
\ J / 

converges to E (X„i | Xo). 
Proof: 5*66 [Gj 

Theorem [3] shows that the conditional expectations needed in Step 3 of the one-step 
approximation algorithm can be calculated in an iterative manner from the conditional 
distributions of elements in a row and column, as in Algorithm |3] Thus, Theorems [2] 
and [3] mean that the conditional expectations can be calculated by separately invert- 
ing the row and column covariance matrices, instead of the overall Kronecker product 
covariance. This reduces the order of operations from around 0{n^p^) to 0{n^ + P^), 
a substantial savings. In addition, if both the covariance estimates and their inverses 
are known, then one can use the properties of the Schur complement to further speed 
computation. For extremely sparse matrices or data with few missing elements, the 
order of operations is nearly linear in n and p (See|E|. We also note that the structure 
of the Alternating Conditional Expectations Algorithm often leads to a faster rate of 
convergence as discussed in the proof of Theorem [3j For high-dimensional data, these 
two results mean that matrix-variate models can be used in any application where mul- 
tivariate models are computationally feasible, thus opening the door to applications of 
transposable models! 



4.2.3 Numerical Comparisons 

We now investigate the accuracy of the one-step approximation algorithm in terms of 
observed log-likelihood and imputation accuracy with a numerical example. Here, we 
simulate fifty datasets, 25 x 25, from the matrix-variate normal model with autoregres- 
sive covariance matrices. 

• Autoregressive: = 0.8l*~-'l and A^^ = 0.6l'--'l. 

We delete values at random according to certain percentages and report the mean MSE 
for both the MCECM algorithm, TRCMimpute, and the one-step approximation, TRC- 
MAimpute on the right in Figure [l] The one-step approximation performs comparably, 
or slightly better, in terms of imputation error to the MCECM algorithm for all percent- 
ages of missing values. We note that TRCMAimpute could give better missing value 
estimates if the MCECM algorithm converges to a sub-optimal stationary point of the 
observed log-likelihood. For a dataset with 25% missing values, we apply the MCECM 
algorithm and also apply our approximation extended beyond the first step, but denote 
the observed log-likelihood after the first step with a star on the right in Figure [T| 
This shows that using marginals to provide a good starting value, does indeed start the 
algorithm at a higher observed log-likelihood. Also, after the first step, the observed 
log-likelihood is very close to the fully-iterated maximum. Thus, the one-step approxi- 
mation appears to be a comparable approximation to the TRCMimpute approximation 
which is feasible for use with high-dimensional datasets. 



5 Results and Simulations 

The following results indicate that imputation with transposable regularized co- 
variance models is useful in a variety of situations and data types often giving much 
better error rates than existing methods. We first assess the performance of our one- 
step approximation, TRCMAimpute, under a variety of simulations with both full and 
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Figure 1: Comparison of the mean MSE with standard errors (left) of the MCECM im- 
putation algorithm (TRCMimpute) and the one-step approximation (TRCMAimpute) 
for transposable data of dimension 25 x 25 with various percentages of missing data. 
Fifty datasets were simulated from the matrix-variate normal distribution with autore- 
gressive covariances as given in Section 4.2.3 Observed log-likelihood (right) verses 



iterations for TRCMimpute and TRCMAimpute with 25% missing values. The one- 
step approximation begins at the TRCM parameter estimates using the imputed values 
from RCMimpute. The observed log-likelihood of one-step approximation is given by a 
star after the first step. All methods use L2 penalties with p,. = pc = I for comparison 
purposes. 



sparse covariance matrices. Authors have suggested that microarrays and user-ratings 
data, such as the Netflix movie-rating data are transposable or matrix-distributed (flUjl . 
dU, hence, we also assess the performance of our methods on these types of datasets. 
We compare performances to three commonly used single imputation methods - SVD 
methods (SVDimpute), fc-nearest neighbors (KNNimpute), and local least squares (LL- 
Simpute) (j5U|) . ^TE^ . For the SVD method, we use a reduced rank model with a column 
mean effect. The rank of the SVD is determined by cross-validation; regularization is 
not used on the singular vectors so that only one parameter is needed for selection by 
cross-validation. For fc-nearest neighbors and local least squares also, a column mean 
effect is used and the number of neighbors, fc, is selected via cross-validation. If the 
number of observed elements is limited, the pairwise-complete correlation matrix is used 
to determine the closest neighbors. 

5.1 Simulations 

We test our imputation method for transposable data under a variety of simulated 
distributions, both multivariate and matrix-variate. All simulations use one of four 
covariance types given below. These are numbered as they appear in the simulation 
table. 

1. Autoregressive: = 0.8l'--'l and A,j = 0.6^''^^. 
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TRCMAinipute 


Others 




Li : L2 


L2 '■ -^2 


SVD 


KNN 


Si 


0.8936 (0.01) 
45/50 


0.725 (0.0069) 
0/50 


0.5919 (0.0056) 
50/50 


0.634 (0.0081) 


0.448 (0.005) 


Si, Ai 


0.8255 (0.012) 
0/50 


0.6315 (0.0078) 
0/50 


0.5402 (0.0067) 
0/50 


0.4603 (0.0083) 


0.8034 (0.016) 


S2 


0.895 (0.016) 
43/50 


0.7829 (0.013) 
0/50 


0.6392 (0.008) 

48/50 


0.993 (0.019) 


0.9498 (0.017) 


S2, A2 


0.749 (0.044) 
0/50 


0.6867 (0.034) 
0/50 


0.4556 (0.0098) 
48/50 


0.6821 (0.051) 


0.8273 (0.055) 


S3 


1.04 (0.017) 
37/50 


1.02 (0.017) 
9/50 


0.9348 (0.016) 

49/50 


0.7384 (0.012) 


0.9115 (0.014) 


S3, A3 


1.012 (0.02) 
5/50 


0.9477 (0.019) 
0/50 


0.8585 (0.017) 
37/50 


0.7271 (0.016) 


0.9886 (0.019) 


S4 


0.9986 (0.018) 
37/50 


0.9407 (0.017) 
3/50 


0.8067 (0.014) 

48/50 


0.4903 (0.0076) 


0.9057 (0.014) 


S4, A4 


0.9726 (0.033) 
6/50 


0.855 (0.028) 
0/50 


0.6999 (0.022) 
39/50 


0.5282 (0.024) 


0.9366 (0.031) 


Si 


0.9134 (0.0096) 
21/50 


0.9083 (0.0092) 

18/50 


0.8948 (0.009) 
21/50 


1.173 (0.013) 


0.9349 (0.0092) 


Si, Ai 


0.867 (0.011) 
0/50 


0.8569 (0.01) 
0/50 


0.845 (0.0096) 
0/50 


0.9535 (0.01) 


0.9736 (0.013) 


S3 


1.053 (0.01) 
7/50 


1.052 (0.01) 
7/50 


1.048 (0.01) 
7/50 


1.22 (0.013) 


1.03 (0.01) 


S3, A3 


1.001 (0.014) 
0/50 


0.9968 (0.014) 
0/50 


0.9945 (0.014) 

1/50 


1.11 (0.016) 


1.006 (0.014) 



Table 1: Mean MSE with standard error computed over 50 datasets of dimension 50 x 50 
simulated under the matrix-variate normal distribution with covariances given in Section 
5.1 In the upper portion of the table, 25% of values are missing and in the lower, 75% 
missing. The TRCM one-step approximation with Li : Li, Li : Li and Li : Li penalties 
was used as well as the SVD and k-nearest neighbor imputation. Below the errors for 
TRCMAimpute, we give the number of simulations out of 50 in which a marginal, mul- 
tivariate method (RCMimpute) was chosen over the matrix-variate method. Parameters 
were chosen for all methods via 5-fold cross validation. Best performing methods are 
given in bold. 



2. Equal ofF-diagonals: — 0.5 and A^- = 0.5 for i 7^ j, and = 1 and A^; = 1. 

3. Blocked diagonal: Ha — 1 and An = 1 with ofF-diagonal elements of 5 x 5 blocks 
of S are 0.8 and of A, 0.6. 

4. Banded ofF-diagonals: Ha — 1 and A^^ 1 with 

J 0.8 if \i — j\ divisible by 5. J 0.6 if \i — j\ divisible by 5. 

I otherwise. I otherwise. 

The first simulation, with results in Table [l] compares performances with both mul- 
tivariate distributions, only S given, and matrix-variate distributions, both S and A 
given. In these simulations, the data is of dimension 50 x 50 with either 25% or 75% of 
the values missing at random. The simulation given in Table [2] gives results for matrix- 
variate distributions with one dimension much larger than the other, 100 x 10 and 10% 
of values missing at random. The final simulation, in Table |3] tests the performance 
of our method when the data has a transposable covariance structure, but is not nor- 
mally distributed. Here, the data, of dimension 50 x 50 with 25% of values missing 
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at random, is either distributed Chi-square with three degrees of freedom or Poisson 
with mean three. The Chi-square and Poisson distributions introduce large outliers and 
the Poisson distribution is discrete. All three sets of simulations are compared to SVD 
imputation and fc-nearest neighbor imputation. 





TRCMAimpute 


Others 


L2 ■ Li 


L2 : L2 


SVD 


KNN 


Si, Ai 


0.8227 (0.019) 


0.7072 (0.016) 


1.075 (0.024) 


0.6971 (0.018) 


5:2, A2 


1.019 (0.15) 


0.9441 (0.13) 


1.306 (0.23) 


1.057 (0.17) 


S3, A3 


0.9372 (0.047) 


0.841 (0.042) 


1.121 (0.05) 


0.9241 (0.042) 


S4, A4 


0.7044 (0.059) 


0.6148 (0.049) 


0.9751 (0.074) 


1.118 (0.089) 



Table 2: Mean MSB with standard errors over 50 datasets of dimension 100 x 10 with 
10% missing values simulated under the matrix-variate normal with covariances given in 
Section 5.1 The TRCM one-step approximation with L2 : Li and L2 '■ L2 penalties was 
used as well as the SVD and k-nearest neighbor imputation. Parameters were chosen 
for all methods via 5-fold cross validation. Best performing methods are given in hold. 





TRCMAimpute 


Others 


L2 : Lj 


L2 '• 1^2 


KNN 


SVD 


Chi-square 


Si, Ai 


3.824 (0.065) 


2.611 (0.044) 


6.85 (0.34) 


7.684 (0.15) 


S3, A3 


5.525 (0.14) 


5.068 (0.15) 


29.41 (0.83) 


50.16 (0.74) 


Poisson 


Si, Ai 


2.442 (0.05) 


1.571 (0.021) 


8.04 (0.34) 


5.824 (0.11) 


S3, A3 


3.045 (0.075) 


2.813 (0.081) 


29.13 (0.95) 


49.2 (0.68) 



Table 3: Mean MSB with standard error computed over 50 datasets of dimension 50 x 50 
with 25% missing values simulated under the Chi-square distribution with 3 degrees of 
freedom or the Poisson distribution with mean 3 with Kronecker product covariance 
structure given by the covariances in Section 5.1 The TRCM one-step approximation 
with L2 '. Li and B2 '. L2 penalties was used as well as the SVD and k-nearest neighbor 
imputation. Parameters were chosen for all methods via 5-fold cross validation. Best 
performing methods are given in bold. 



These simulations show that TRCMAimpute is competitive with two of the most 
commonly used single imputation methods, SVD and fc-nearest neighbor imputation. 
First, TRCM with B2 penalties outperforms the other possible TRCM penalty types. 
This may be due to the fact that the covariance estimates with B2 penalties has a glob- 
ally unique solution. Theorem [T] while the estimation procedure for other penalty types 
only reaches a stationary point. Proposition [2j The one-step approximation permits the 
flexibility to choose cither multivariate or transposable models. As seen with smaller 
percentages of missing values, cross-validation generally chooses the correct model for 
the Bi : Bi penalty-type, but seems to prefer the marginal multivariate models for the 
B2 ■ B2 penalties. However, with 75% of the values missing, the transposable model is 
often chosen even if the underlying distribution is multivariate. The additional structure 
of the TRCM covariances may allow for more information to be gleaned from the few 
observed values, perhaps explaining the better performance of the matrix-variate model. 
TRCMAimpute seems to perform best in comparison to SVD and fc-nearest neighbor 
imputation for the full covariances with equal off-diagonal elements. Our TRCM-based 
imputation methods appear particularly robust to departures from normality and per- 
form well even in the presence of large outliers, as shown in Table[3j Overall, imputation 
methods based on transposable covariance models compare favorably in these simula- 



17 



tions. 



5.2 Microarray Data 

Microarrays are high-dimensional matrix-data that often contain missing values. Usu- 
ally, one assumes that the genes are correlated while the arrays are independent. Efron 
questions this assumption, however, and suggests using a matrix-variate normal model 
©. Indeed, the matrix-variate framework, and more specifically the TRCM model seem 
appropriate models for microarray data for several reasons. First, one usually centers 
both the genes and the arrays before analysis, a structure which is built in to our model. 
Second, TRCMs have the ability to span many models which include a marginal model 
where the rows are distributed as a multivariate normal and the arrays are indepen- 
dent. Hence, if a microarray is truly multivariate, our model can accommodate this. 
But, if there are true correlations within the arrays, TRCM can appropriately measure 
this correlation and account for it when imputing missing values. Lastly, the graphical 
nature of our model can estimate the gene network and then use this information to 
more accurately estimate missing data. 

For our analysis, we use a microarray dataset of kidney cancer tumor samples p5|l . 
The dataset contains 14,814 genes and 178 samples. About 10% of the data is missing. 
For the following figures, all of the genes with no missing values were taken, totaling 
1,031 genes. Missing values were then placed at random. Errors were assessed by 
comparing the imputed values to the true observed values. 

We assess the performance of TRCM imputation methods on this microarray data 
and compare them to existing methods for various percentages of missing values, deleted 
at random, on the right in Figure [2j Here, we use L2 penalties since these are compu- 
tationally less expensive for high-dimensional data. TRCMAimpute outperforms com- 
peting methods in terms of imputation error for all percentages of missing values. We 
note that cross-validation exclusively chose the marginal, multivariate model from the 
one-step approximation. This indicates that the arrays in this microarray dataset may 
indeed by independent. 

Often, microarray datasets are not missing randomly. Also, researchers are interested 
in not only the error in terms of MSE, but the individual errors made as well. To 
investigate these issues, we assess individual absolute errors of data that is missing 
in the same pattern as the original data. For each complete gene, values were set 
to missing in the same arrays as a randomly sampled gene from the original dataset. 
The right panel of Figure [2] displays the boxplots of the absolute imputation errors. 
Lines are drawn at quantiles to assess the relative performances of each method. Here, 
TRCMAimpute has lower absolute errors at each quantile. Also, the set of imputed 
values has far fewer outliers than competing methods. The mean absolute error for 
TRCMAimpute is 0.37, far below the next two methods, LLSimpute and SVDimpute 
which have a mean absolute error of 0.46. Altogether, our results illustrate the utility 
and flexibility of using TRCMs for missing value imputation in microarray data. 

5.3 Netflix Data 

We compare transposable regularized covariance models and existing methods on the 
Netflix movie rating data ([2]). The TRCM framework seems well-suited to model this 
user-ratings data. As discussed in the introduction, our model allows for not only 
correlations among both the customers and movies, but between them as well. In 
addition, TRCM models the graph structure of the customers and the movies. Thus, 
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L2:L2 SVD LLS KNN 

Proportion of missing values (at random) 

Figure 2: Left: Comparison of MSE for imputation methods on kidney cancer mi- 
croarray data with different proportions of missing values. Genes in which all samples 
are observed are taken with values deleted at random. TRCMAimpute, L2 : L2 and 
common imputation methods KNNimpute, SVDimpute and LLSimpute are compared 
with all parameters chosen by 5-fold cross-validation. Cross-validation chose to penalize 
only the arrays for the one-step approximation algorithm, TRCMAimpute. Right: Box- 
plots of individual absolute errors for various imputation methods. Genes in which all 
samples are observed were taken and deleted in the same pattern as a random gene in 
the original dataset. Lines are drawn at the 50%, 75%, 95%, 99%, and 99.9% quantiles. 
TRCMAimpute, Li : L2 has a mean absolute error of 0.37 and has lower errors at every 
quantile than its closest competitor, SVDimpute which has a mean absolute error of 
0.46. 



we can fill in a customer's rating of a particular movie based on the customer's links 
with other customers and the movie's links with other movies. Also, many have noted 
that the unrated movies in the Netflix data are not simply missing at random and may 
contain meaningful information. A customer, for example, may not have rated a movie 
because the movie was not of interest and thus they never saw it. While it may appear 
that our method requires a missing at random assumption, this is not necessarily the 
case. When two customers have similar sets of unrated movies, after removing the 
means, our algorithm begins with the unrated movies set to zero. Thus, these two 
customers would exhibit high correlation simply due to the pattern of missing values. 
This correlation could yield an estimated "link" between the customers in the inverse 
covariance matrix. This would then be used to estimate the missing ratings. Hence, our 
method can find relationships between sets of missing values and use these to impute 
the missing values. 

The Netflix dataset is extremely high-dimensional, with over 480,000 customers and 
over 17,000 movies, and is very sparse, with over 98% of the ratings missing. Hence, 
assessing the utility of our methods from this data as a whole is not currently feasible. 
Instead, we rank both the movies and the customers by the number of ratings and 
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take as a subset, the top 250 customer's ratings of the top 250 movies. This subset has 
around 12% of the ratings missing. We then delete more data at random to evaluate the 
performance of the methods. In addition, for each customer in this subset ratings were 
deleted for movies corresponding to the unrated movies of a randomly selected customer 
with at least one rating out of the 250 movies. This leaves 74% of ratings missing. Figure 
[Sjcompares the performances of the TRCM methods to existing methods for both subset 
with both missing at random and missing in the pattern of the original values. 




L2:L2 L1:L1 SVD KNN 

Proportion of missing values (at random) 

Figure 3: Left: Comparison of the root MSE (RMSE) for a subset of the Netfiix data for 
TRCMAimpute, L2 : L2 and Li : Li, to KNNimpute and SVDimpute. A dense subset 
was obtained by ranking the movies and customers in terms of number of ratings and 
taking the top 250 movies and 250 customers. This subset has around 12% missing and 
additional values were deleted at random, up to 95%. With 95% missing, the RMSE 
of TRCMAimpute is 1.049 compared to 1.084 of the SVD and 1.354 using the movie 
averages. Right: Boxplots of absolute errors for the dense subset with missing entries 
in the pattern of the original data. Customers with at least one ranking out of the 250 
movies were selected at random and entries were deleted according to these customers 
leaving 74% missing. Quantiles of the absolute errors are shown at 50%, 75%, 95%, 
99%, and 99.9%. The RMSE of the methods are as follows L2 : L2: 1.005, Li : Li. 
1.029, SVD: 1.032, KNN: 1.184. 

Before discussing these results, we first make a note about the comparability of our 
errors rates to those for the Netfiix Prize (2 ). Because we chose the subset of data based 
on the number of observed ratings, we can expect the RMSE to be higher here than 
applying these methods to the full dataset. This method of obtaining a subset leaves 
out potentially thousands of highly correlated customers or movies that would greatly 
increase a method's predictive ability. In fact, the RMSE of the SVD method on the 
entire Netfiix data is 0.91 (j^, much less than the observed RMSE of 1.084 for the 
SVD on our subset with 95% missing. Thus, we can conjecture, that all of the methods 
we present would do better in terms of RMSE using the entire dataset than the small 
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subset on which we present results. 

The results indicate that TRCM imputation methods, particularly with L2 penal- 
ties, are competitive with existing methods on the missing at random data. At higher 
percentages of missing values, our methods perform notably well. With 95% missing 
values in our subset, TRCMAimpute has a RMSE of 1.049 compared to the SVD at 
1.084 and 1.354 using the movie averages. This is of potentially great interest for missing 
data imputation on a larger scale where the percentage of missing data is greater than 
98%. We note that at smaller percentages of missing values, marginal models penaliz- 
ing the movies were often chosen by cross-validation, indicating that the movies may 
have more predictive power. Whereas, at larger percentages of missing values, cross- 
validation chose to penalize both the rows and the columns, indicating that possibly 
more information can be gleaned from few observed values using transposable methods. 

Our methods also preform well when the data is missing in the same pattern as 
the original. The L2 : L2 method had the best results with a RMSE of 1.005 followed 
by Li : LI with 1.029, SVD with 1.032 and fc-nearest neighbors with 1.184. From 
the boxplots of absolute values Figure [3] (right), we see that the SVD has many large 
outliers in absolute value, while the Li penalties led to the fewest number of large errors. 
Since leading imputation methods for the Netflix Prize, are ensembles of many different 
methods ([TJ, we do not believe that TRCM methods alone would outperform ensemble 
methods. If, however, our methods outperform other individual methods, they could 
prove to be beneficial additions to imputation ensembles. 

6 Discussion 

We have formulated a parametric model for matrix-data along with computational 
advances that allow this model to be applied to missing value estimation in high- 
dimensional datasets with possibly complex correlations between and among the rows 
and columns. 

Our MCECM and one-step approximation imputation approaches are restricted to 
datasets where for each pair of rows, there is at least one column in which both entries 
are observed and vice versa for each pair of columns. A major drawback of TRCM impu- 
tation methods is computational cost. First, RCMimpute using the columns as features 
costs 0{p^). This is roughly on the order of other common imputation methods such 
as the SVDimpute which costs 0{np^). Our one-step approximation, TRCMAimpute, 
using the computations for the Alternating Conditional Expectations algorithm given 
inlEl costs 0(X]r=i ™'^{l'^il> + l^il}^ + where \mi\ and 

joijare the number of missing and observed elements of row i respectively. 

The main application of this paper has been to missing value imputation. We note 
that this is separate from the matrix-completion methods via convex optimization of 
Candes and Recht (|5|), which focuses on matrix-reconstruction instead of imputation. 
Also, we have presented a single imputation procedure, but our techniques can easily 
be extended to incorporate multiple imputation. We present a repeated imputations 
approach by taking samples from the posterior distribution (|28ll with the Bayesian one- 
step approximation in the[D] In addition, we have not discussed ultimate use or analysis 
of the imputed data, which will often dictate the imputation approach. Our imputation 
methods form a foundation that can be extended to further address these issues. 

We also pause to address the appropriateness of the Kronecker product covariance 
matrix to model the covariances observed in real data. While we do not assume that this 
particular structure is suitable for all data, we feel comfortable using the model because 
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of its flexibility. Recall that all marginal distributions of the mean-restricted matrix- 
variate normal are multivariate normal. This includes the distribution of elements within 
a row or column, or the distribution of elements from different rows or columns. All of 
the marginals of a set of elements are given by the mean and covariance parameters of the 
elements' rows and columns, belong. Thus, our model says that the location of elements 
within a matrix determine their distribution, often a reasonable assumption. Also, if 
either the covariance matrix of the rows or the columns is the identity matrix, then 
we are back to the familiar multivariate normal model. This flexibility to flt numerous 
multivariate models and to adapt to structure within a matrix is an important advantage 
of our matrix-variate model. 

Transposable regularized covariance models may be of potential mathematical and 
practical interest in numerous flelds. TRCMs allow for non-singular estimation of the 
covariances of the rows and columns, which is essential for any application. Adding 
restrictions to the mean of the TRCM, allows one to estimate all parameters from 
a single observed data matrix. Also, introduction of efficient methods of calculating 
conditional distributions and expectations make this model computationally feasible 
for many applications. Hence, transposable regularized covariance models have many 
potential future uses in areas such as hypothesis testing, classiflcation and prediction, 
and data mining. 

A Imputation for Multivariate Data (RCMimpute) 

The EM algorithm for the multivariate normal has been traditionally used to impute 
missing values. This method is based on maximum likelihood estimation of parameters 
in the presence of missing values. Imputation is simply a side result 1^ . Many 
have noted that this approach is not well suited to high-dimensional data, especially 
when the number of features is large. Hence, a common remedy to this problem, seen 
predominately in image processing, is to penalize the log-Hkelihood, (fT^ . (HHI, (HU) 

m- 

Our method maximizes the penalized log-likelihood of the observed data given below. 
First, however, we assume that our data Xi ~ A) with i = 1 . . .n observations and 
p features, with values missing at random. We let Xi^ot denote the observed components 
of observation i and Xi^mi denote the missing components. For each observation, we 
partition the mean and covariance to correspond to the observed parts of observation i 
and denote these as fioi and ^oi,oi- 

The penalized observed data log- likelihood is: 

1 " 

i = l 

-piiA'Mr. (11) 

Notice that except for the last penalty term, this is the same observed log-likelihood 
that is maximized by the original multivariate normal EM algorithm (l2T]l . 

Missing data imputation is part of the E step of the algorithm in which the con- 
ditional expectation of the complete-data log-likelihood is taken given the current pa- 
rameter estimates as in un-penalized EM algorithms. This is denoted by Q{9\9'^^'>) = 
E (^(m, A) |Xo, A') , letting 9 = (/i. A) and where Xo denotes the totality of observed 
elements of X. We can break this down into two parts which we term the imputation 
and covariance correction steps. These steps come directly from the conditional distri- 
bution formulas for the multivariate normal. 
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Imputation Step: 



[Xij, it J G Oi. 



Covariance Correction Step: E {xijXij> [xi^oi , fJ.', A') — Xi^Xi^r + 



3 

0, else. 

We call Ci_jj' the covariance correction term because it is added to the cross products 
forming the covariance matrix. Notice that Cijji is only non-zero if both j and j' are 
missing. Intuitively, this correction term is needed because in the imputation step, we 
set the missing values equal to their conditional expectations and hence lose some of 
the variance associated with these elements. 

The Maximization step is where our algorithm differs from that of Little and Rubin 
(pTjl . In the M step, we must maximize Q{9\9^'''') with respect to 6 to obtain the new 
estimate 9^''~^^\ giving 

Q(0|0«) = |logi 1 - Itr (A' A-^) - pll A-i ir, 

n 

where Aj^, = ^ [(i^- - Hj){xij - Hj) + Cijj>] . 

i = l 

The computations in the E step come into Q through A . Maximizing with respect 
to /i, gives the estimate fij — X^ILi ~ S:ij/n. Replacing /i with /I in A , we see that 
Q has the structure of the regularized covariance models. Hence, our estimate of A is 
obtained by applying either the Li or L2 solvers of the RCM problem. We break this 
M step into two parts which we present in the imputation algorithm. Algorithm [4] 

Algorithm 4 Imputation for Regularized Covariance Models (RCMimpute) 

1. Initialization: 

(a) Set the missing values to the mean: Xi^rni = Sieo ^ij/'^i 

(b) Set fi^'^^ and A to the empirical mean and covariance. 

2. E Step: 

(a) Imputation: Compute E ^Xijjii^oi,/^*^'^'', A ^ 

(b) Cross Products: Compute E ^x^jXij' |a;i^oi , Z^*-'^'', A^'^'j . 

3. M Step: 

(a) Update Estimates: (ij & ^'jj'- 

(b) Maximize penalized log-likelihood with respect to A"^ to obtain the new 
estimate A. 

4. Repeat steps 2-3 until convergence. 
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This regularized covariance model imputation approach (RCMimpute) is closely re- 
lated to other penalized EM methods for missing value estimation. These algorithms 
give non-singular covariance estimates (iT6)l . thus enabling use of the EM framework 
when p > n. Also, our algorithm has a unified theoretical framework based on the 
regularized covariance models. 

B Covariance Estimation Results 

We investigate the accuracy of our transposable regularized covariance estimates 
through simulation using the KuUback-Leibler divergence as the metric. As the focus of 
this paper is on the application of our models to missing data imputation, we compare 
our covariance estimates to simple shrinkage estimates for completeness. We will assume 
that our data has mean zero an has previously been centered as in Proposition 1, and 
the KuUback-Leibler (K-L) divergence is given in Proposition |4] 

Proposition 4. The KuUback-Leibler divergence for the mean-restricted matrix- variate 
normal with mean zero is 

Eis.A) A) - e{±. A)] = |iog|i: 1 + ^iog| A a-^ I " ^ + l*-'' (^"' ^) (^"' ^) ■ 

Proof. See Section\G\ 
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Figure 4: Mean KuUback-Leibler divergence with standard errors verses the penalty pa- 
rameter p ( "rho" ) for 50 datasets of dimension 50 x 50 simulated from the matrix- variate 
normal distribution. Autoregressive (left) and blocked diagonal (right) covariance matri- 
ces are used as given in Section|B] The divergence is compared between two transposable 
regularized covariance estimates, L2 ■ L2 and Li : ii,with the penalty parameters on 
both covariance matrices equal to p and three families of shrinkage estimators, SE I, SE 
II, and SE III described in Section [B] also indexed by p. 

In Figure|4] we compare covariance estimates of the Li : Li and L2 : L2 transposable 
regularized covariance models to three other shrinkage covariance estimates parameter- 
ized by p using the KuUback-Lciblcr (K-L) distance. The data of dimension 50 x 50 was 
simulated under the matrix- variate normal model with autoregressive covariances, left, 
and blocked diagonal covariances, right. The covariances are as follows: 
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• Autoregressive: O.SI'-JI and A,j ^ O.el'-Jl. 

• Blocked diagonal: Ofl-diagonal elements of 5 x 5 blocks of S are 0.8 and of A, 
0.6. 

In the TRCM estimates, both penalty parameters, pr and Pc are set to p for compara- 
bility. The three comparison transposable shrinkage covariance estimators are straight- 
forward extensions of simple multivariate shrinkage estimates. 

• SE I: Denote the SVD of X by X = U D V'^, then 

- SsEi = U(D2+pI)uT, & AsEi = V(D2-hpI)VT. 

• SE II & III: Let S be the unpenalized MLE of S, i.e. S = XX'^/p and D be the 
unpenahzed MLE of A, D = X"^ X /n, then 

- tsEii^ p'-^I + {l-p)S, & AsEii =pi^I + (l-p)D. 

- SsE m = p diag(S') + (1 - p)S, & Ase m = p diag(D) + (1 - p)D. 

These covariance estimates are taken in the spirit of shrinkage estimators presented 
by Daniels and Kass ([7]), which either shrink the eigenvalues (SE I) or shrink towards a 
specific structure(SE II and SE III). 

According to the Kullback-Leibler distance, the L2 : L2 TRCM covariance estimates 
are more accurate than our comparison estimators and the Li : Li TRCM estimates 
even with underlying sparsity. This finding is not unexpected since from Theorem 1, 
we have a unique solution compared with the possibly many sub-optimal stationary 
points of the Li : Li TRCM solution, Proposition 2. From these results, we conjecture 
that the L2 : L2 TRCM estimates may lead to significant improvements in missing data 
imputation. 

C MCECM Algorithm Properties and Comparisons 

We present some properties and the rationale for structuring our imputation ap- 
proach with transposable models as a Multi-Cycle ECM algorithm. The MCECM al- 
gorithm is a special case of the ECM algorithm of Meng and Rubin (l23|l . where the 
Maximization step is split into several Conditional Maximization (CM) steps. Expecta- 
tion steps are inserted between the CM steps to form multi-cycles. Our algorithm uses 
a specific type of conditional maximization, maximization with respect to one block of 
coordinates, either S"^ or A"^. These ECM-type algorithms are, as Meng and Rubin 
say, a type of "extended GEM" (Generalized EM) algorithm ((5^ , and thus retain many 
of the properties of GEM algorithms. Most importantly, our algorithm is monotonic, 
increasing the observed likelihood each step, and converges to a stationary point of the 
observed likelihood function. 

We note that the ECM and also the MCECM algorithm can lead to a substantial 
computational savings in terms of the rate of convergence (P^ . We investigate this 
possibility with simulated data in Figure [5] where we compare the convergence of the 
MCECM and the ECM approach to imputation with transposable data. The data, 
25 X 25 with 20% of the data missing, is simulated from the matrix-variate normal with 
the autoregressive covariance parameters: 

• Autoregressive: = O.Sl*"-'! and A,j ^ O.el'-J'L 
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Figure 5: Comparison of convergence of the observed log-likelihood by iterations (left) 
and by time in seconds (right) of the two structural approaches to the transposable EM 
algorithm: the Multi-cycle ECM and the ECM algorithms. The MCECM algorithm 
took 6.01 seconds of CPU time and the ECM algorithm took 12.375 seconds of CPU 
time until convergence. The data, 25 x 25 with 20% missing values, was simulated from 
the matrix-variate normal distribution with autoregressive covariances given in Section 

o 



For consistency, we used the L2 '■ L2 algorithm with pr = Pc = ^ for both methods. From 
this simulation, we see that the MCECM algorithm converges in fewer iterations and 
in less time than the ECM algorithm. The MCECM algorithm reaches its maximum 
in 11 iterations, whereas the ECM algorithm needs 20 iterations. In our experience, 
the MCECM algorithm provides substantial computational savings in both real and 
simulated data. 



D Bayesian One- Step Approximation 

In discussing the computations involved in the Expectation step of the MCECM al- 
gorithm for imputation, we mentioned the possibility of extending the algorithm to a 
stochastic or stochastic approximation ECM-type algorithm. Here, we also note that 
the one-step approximation can be formulated as a stochastic one-step algorithm using 
Gibbs sampling. First, we present a blocked Gibbs sampler. Algorithm |5] for stochas- 
tically generating missing values from their posterior distribution, and then we discuss 
how this can be used in a Bayesian- type one-step approximation. 

The blocked Gibbs sampler generates all missing values in a row or a column as a 
group from their conditional distributions given in Theorem 2. Thus, a deterministic 
overlapping blocking scheme is used to update the elements that can lead to faster 
convergence (|26p . This algorithm converges to the stationary distribution of the missing 
values given the observed values, and thus can be used in place of the Alternating 
Conditional Expectation Algorithm in the final step of the approximation. We call this 
the Bayesian one-step approximation. The conditional distribution of the missing values 
can also be thought of as the posterior distribution from which repeated draws can form 
a set of repeated imputations in the multiple imputation framework (j28j) . 
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Algorithm 5 Blocked Gibbs Sampler for estimating E step calculations in TRCMim- 
pute or the final step of the Bayesian one-step approximation. 

1. For each row, i, with missing values: 

(a) Generate x|^j^^^' from g (^^i^m, I ^l|o', j ^^i,r) ' gi'^'^n in Theorem 2 (a). 

2. For each column, j, with missing values: 

(a) Generate X^+j' from g (x,„^,, | x[,^:^, X^^^^.) , given in Theorem 2 (b). 

3. Repeat Steps 1 and 2 until a stationary distribution is reached. 



E Computations for Alternating Conditional Expec- 
tations Algorithm 

The Alternating Conditional Expectations algorithm is the key component in the one- 
step approximation TRCMAimpute. Thus, computational costs are important espe- 
cially when applying this imputation algorithm to high dimensional data. We show 
that using properties of the Schur complement, the order of operations can be reduced 
from 0{n^ + p^) to ©(X^Li ™n{|TOi|, |oj|}^ -|- niin{|mj|, |oj |}^), where \mi\ and 

I Oil are the number of missing and observed elements of row i respectively. 

With both the Li and L2 TRCM penalties, the covariance estimates and their in- 
verses are easy to obtain from the computations of the graphical lasso ()T4| and the 
eigenvalue decompositions. We present the alternative forms using the Schur comple- 
ments of the row covariance estimate and its inverse of part (i) in Theorem 2. The 
forms are analogous for the columns in part (ii). Given S and its inverse 0, we have 
the following. 





®^.k ' 




M \ 


®k,r 


®k,k J 




V I j 



Thus, S;^,, - Sj,fc E'lfc^fc Efc^i = 1/ 0i,j, and E^^fc Ej, j, = -0i,fe/0j,i, meaning that 
r = A / according to the notation on Theorem 2. If we let ^ be the inverse of A 
and partition it according to and o^, then we have — o.r~^jj.ro. — 

^mi,nii/®i,i ^nd T ^^^Oi'^ Oi ,oi = " ,mi *m.,oi ■ Thus, in the second step of the 
theorem, if the number of missing elements in row i is less than the number of observed 
elements, this alternative form requires less computation. 

F Cross Validation for TRCMAimpute 

With real data, the TRCM penalty parameters Pr and must be estimated from the 
data. To this end, we use 5-fold cross validation in all simulations and examples. This 
is accomplished by randomly deleting 20% of the observed values in each fold, applying 
an imputation method, and then measuring the imputation error on the deleted values. 
When selecting penalty parameters, we always include the possibility of an infinite value 
giving a marginal, multivariate model obtained in the first step of our approximation 
method. Thus, for application of TRCMAimpute, a model penalizing only the rows. 
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only the columns, or both may be chosen by the cross-validation procedure. Also, we 
remind the reader that the four penalty types of the TRCM model are referred to as 
Lq^ : Lq^, or the penalty-type placed on the rows and then the penalty-type on the 
columns. 
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Figure 6: True MSE and 5-fold cross-validation estimated MSE with standard errors. 
Genes from the kidney cancer microarray data with all samples observed are taken and 
deleted at random with 10% missing. TRCMAimpute, L2 : L2 is used with an infinite 
penalty on the column covariances. Cross-validation works well for determining the best 
penalty parameter. 




In Figure [6] we give the estimated MSE from 5-fold cross validation and the true 
MSE. TRCMAimpute, L2 : L2 was used with an infinite penalty on the columns. This 
shows us that while, cross-validation may not estimate the true MSE well, it is fairly 
accurate as estimating the correct penalty parameter. 

G Proofs 

Proof of Proposition 1. Expanding the trace term of ^(M,S, A) in terms of /i and v 
and then taking partial derivatives, we get 

1^ = 2 S-i A-i -2 E-i (X A-i = 

av 

P f^i P 

A similar argument gives /t. □ 

Proof of Propoisition 2. Notice that the first three terms of ^(S, A) are differentiable 
and A) is continuous. Then, since A) is strictly concave in S"^ with A"^ fixed 
and in A"^ with S"^ fixed, maximization with respect to each gradient gives a unique 
coordinate-wise maximum. Thus, we have satisfied the conditions of Tseng, Theorem 
4.1 (c) (!31l), and block coordinate-wise maximization converges to a stationary point of 
£(S,A). ' □ 
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Proof of Theorem 1. Our proof that the solution in Theorem 1 maximizes the penaHzed 
log-Hkelihood with L2 penalties begins with the coordinate-wise gradients and shows 
that there is only one solution to these equations and it is thus the globally optimal 
solution. Throughout this proof we assume X is centered and let X = UD V'^ be the 
SVD of X and d = diag(D) . 

First, we can take the eigenvectors of S* and A* to be the left and right singular 
vectors of X respectively, because rearranging the gradients, gives 

pT.-AprTr^ = XA-^X^ 
nA-4pc = X^ S-^X. 

Thus, the eigenvectors of S and A must be equal to their respective quadratic forms. 
This gives only one solution for the eigenvectors which is the left and right singular 
vectors of X. Note that if rank(X) = r, then the last n — r eigenvectors of U and the 
last p — r of V are not unique. 

Now, given the eigenvectors of S* and A*, we can write our penalized log-likelihood 
function in terms of the eigenvalues [3 and 9 and the singular values d. 



i=l 



1 



4 



2^/3 
.7=1 - 



3 '^3 



1 



1 



Pc 



p 

j=i 3 



(12) 



Note that this is a biconvex function of -g and |. Hence, we can use sequential convex 



programming to minimize ( 12 I, but this may not converge to the global minimum, which 



we seek. Instead, wc use the coordinate- wise gradients of (12), adding n—p zeros to 9 
and d so they are of length n. They are 



Aprt 







nP,9f - d% - 4p,/3,; = 



for i — \ . . 
for i = 1 . . 



. n. 
. n. 



(13) 



Note that the global minimum must satisfy (13 1. The last n 
immediately known and equal to 2y^^. Also, if p> r, then the last p 
are 2w^. So, we concentrate on the first r values of 9 and (3. 



r values of f3* are 
r values of 9* 



Since the gradients, ( |13[ ), are quadratic equations, we can write it as one equation in 

This gives us the following fifth degree polynomial. 



terms of /3, by letting 9, = ^^t^ 



ft (ciPf + C2f3f + C3) 

for each eigenvalue indexed by i with coefficients 







(14) 



„(») 



32prPcP + d1{n - p), & = 4pr(rf* - 16/Or/Oc) 



The five solutions to (14 1 are and ± 



-4c 



(<)„(>) 

3 



But, we are looking for 



solutions that are real and positive in terms of both ft and 9i. Thus, we can immediately 
dismiss the zero solution and the two negative solutions. Now, notice that if df > 16prPc, 
then C3 > 0, and thus by Descartes' sign rule, there exists only one positive real root. 

This root is given by ft = -i / - 



-^/c 



2c V 



We now consider the case when 
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df < 16prPc and we have two positive real roots. The above root is obviously still feasible 



in this case. We check the other positive root given by Pi — 



2c<" 



— - — — to see 



if the corresponding 6i > 0. The numerator of 9i is always strictly positive, so for 0i to 
be feasible, pPf > Apr. Substituting in the possible root, (ii, we see that this inequahty 
does not hold. Hence, this root is infeasible, leaving us with only one feasible root. 

Therefore, we have only one feasible solution {Pi,di) for each i to the gradient equa- 
tions (13 1. Since the global solution must satisfy these gradient equations, we conclude 



that the root is the unique global solution {P*,9*). We comment here that using it- 



erative coordinate descent to solve the quadratic equations ( 13 1 by taking the positive 



roots, converges to this globally optimal solution. This is true because any solution to 
the coordinate procedure must satisfy the coordinate-wise gradient conditions, which 
we have just proven to have only one solution. Thus, this is a rare instance when the 
coordinate descent solution to a biconvex problem converges to the global minimum. □ 



Proof of Proposition 5 

E(S,A 



^(S,A)-£(S,A) 
1 



|(log|S-i|-log|S-i 



(log|A-i|-log|A-i 



-E 



(S,A) 



tr(S-iXA-iX'^) 



tr 



XA X^ 



= |log|SS-i| + ^log|AA-i 



XA 



tr (S-i E(s,A) [X A-i X^] ) - tr (S-iE(s,a) 
tr(S"i [tr(A-i A)S]) -tr (S^^ tr(AA"^)S 
tr(S-i S)tr(A-i A) - tr(E"^ S)tr(A"^ A) 

Note that E(s,a) [XAX^] = tr(A A^) S where A is p x p (fT7)l . 
Proof of Proposition 4. We first show that E [tr (X"^ S'^ X A'^) \Xo, 0'] 



1 r 

2 

1 r 
2 



^log|SS-i| + -log|AA-i 



tr 



(X^ S-iX-hG(S-i)) A-i 



Let A = X'T S-i X, then, 

E [tr (X'^ X A-i) \Xo, 9'] ^ tr [E (A \Xo, 0') A'^] 



□ 



And, E(A,,v \Xoe') = E (X^ S"! 6' 

n n 



E 



_k=i t=i 

n n 



tk 



fe=i t=i fc=i t=i 

X^j X,j, + tr ( C^JJ") 
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Thus, E(A|Xo,6'') = X^S-iX + G(S-i). 



The calculation showing E [tr (X^ S"! X A'^) \Xo, 6'] = tr (x A'^ X^ + F(A-i)) Yr^ 
is the analogous to the above calculation with B = X A"^ X''" and 

E (B,,, \Xo, e') - X„ A-1 Xj,, + tr (d^"'') A'^ 



□ 

Proof of Theorem 2. Notice that Theorem 2 gives the conditional distributions in a two 
step process. The first step involves finding the distribution of a row or column condi- 
tional on the rest of the matrix. Using the notation given, this means (Xj ^ | r) — 
(X,,,„,,X,,o, \Xk,r) ~ N{i^,T) and (X,,, |X,,0 = (X„^,,,X„^,, ~ N{r^',^). These 

are given in Gupta and Nagar ()17p . Given the conditional distributions of a row or col- 
umn, the second step gives the distribution of a set of elements, or rrij within a 
row or column respectively. Notice that these conditional distributions come directly 
from the conditional distribution formulas for partitioned multivariate normal matrices, 
given the partitions of ip, rj, T and Hence, the connection between the two steps is 
all that is needed. We show this for part (a) and the proof for part (b) is analogous. 

If we let W — 'X.i jn., Y — X^ „. and Z — X^j and assume that they are centered 
so that vec(M^ F zj - N{0, n)', then notice that the first step gives p ((V, Y)\Z) and 
the second step gives p We show that this gives the desired conditional 

distribution. 

^ p{W,Y\Z) 

p{Y\Z) 
= p{W\Y,Z). 

Since matrix normal random variables can be written as multivariate normal random 
variables, we have estabUshed that the two steps give the desired conditional distribu- 
tion. □ 

Proof of Theorem 3. We will show that the iterates in Steps 2 and 3 of the Alternating 
Conditional Expectations Algorithm are the block Gauss-Seidel iterates solving the 
linear system giving E(X„j|Xo). For simplicity, we consider X2X2 with vec(X) = 
[xi X2 x^ x^ ^ 7V(0,r2) where xi, X2 are missing and X3, X4 are observed. If we 
partition according to indices of vec(X), the fc + 1 iteration of Step 2 or 3 is then 

1 T 



„(*=+!) 



= c 



(fc) 

X2 X^ Xi 



where Cixs = ^x,,^x^ ^/x,,^x, 
U x\ X3 X4 
where Dix3 = rJ:r2#x2 ^^^^2,#:e2 (1^) 



Define A = ( ) and & = ( ) where 61 = C2-3 [^3 X4]^ and bi = 

D2:3 [xs X4] . Then, solving the linear system A [xi X2] — b gives the conditional 
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expectation, A~ 6 = E ([xi X2] | [0:3 1 = r2i:2.3:4 r2j^4 3.4 [^3 X4] . In addition, 
the Gauss-Seidel iterates, 



1 - ( Oi - ai2X^ M /an 
= {^2 - 0212:1''-'^ /a22 



(/c+ 



give back equations (15 1. Also note that since is positive definite, so is A, ensuring 
convergence of the algorithm. Thus, finding the conditional expectation of each individ- 
ual missing value in an iterative fashion converges to the conditional expectation of all 
missing values given the observed data. Steps 2 and 3 of the Alternating Conditional 
Expectation Algorithm, however, group the missing values by row or by column. Thus, 
these steps form the block Gauss-Seidel iterates which also converge, often with a faster 
rate of convergence (19). In addition, note that the groups of missing values by row and 
by column form overlapping blocks, or multi-splittings. These updating schemes have 
been shown to often converge faster than non-overlapping schemes in linear iterative 
methods (fl^ . 

□ 
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